HEAD ======= >>>>>>> a07efb2cd47cdcc995273873e0718466ba47033c
This document presents workshop materials to get you started on generating forecasts specifically for submission to the EFI-NEON Forecasting Challenge. The Challenge goal is to create a community of practice that builds capacity for ecological forecasting by leveraging NEON data products. The Challenge revolves around the five theme areas that span aquatic and terrestrial systems, and population, community, and ecosystem processes across a broad range of ecoregions that uses data collected by NEON. Learn more about the Challenge here!
The development of these materials has been supported by NSF grants DEB-1926388 and DBI-1933016.
To complete the workshop via this markdown document the following packages will need to be installed:
remotesfpp3tsibbletidyverselubridateneon4cast (from github)The following code chunk should be run to install packages.
#install.packages('remotes')
#install.packages('fpp3') # package for applying simple forecasting methods
#install.packages('tsibble') # package for dealing with time series data sets and tsibble objects
#install.packages('tidyverse') # collection of R packages for data manipulation, analysis, and visualisation
#install.packages('lubridate') # working with dates and times
remotes::install_github('eco4cast/neon4cast') # package from NEON4cast challenge organisers to assist with forecast building and submission
Additionally, R version 4.2 is required to run the neon4cast package. It’s also worth checking your Rtools is up to date and compatible with R 4.2, see (https://cran.r-project.org/bin/windows/Rtools/rtools42/rtools.html).
version$version.string
<<<<<<< HEAD
## [1] "R version 4.3.2 (2023-10-31 ucrt)"
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
=======
## [1] "R version 4.3.2 (2023-10-31)"
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
>>>>>>> a07efb2cd47cdcc995273873e0718466ba47033c
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
If you do not wish to run the code yourself you can follow along via
the html (NEON_forecast_challenge_workshop.md), which can be downloaded
from the Github
repository.
The EFI RCN NEON Forecast Challenge asks the scientific community to produce ecological forecasts of future conditions at NEON sites by leveraging NEON’s open data products. The Challenge is split into five themes that span aquatic and terrestrial systems, and population, community, and ecosystem processes across a broad range of ecoregions. We are excited to use this Challenge to learn more about the predictability of ecological processes by forecasting NEON data before it is collected.
Which modeling frameworks, mechanistic processes, and statistical approaches best capture community, population, and ecosystem dynamics? These questions are answerable by a community generating a diverse array of forecasts. The Challenge is open to any individual or team from anywhere around the world that wants to submit forecasts. Sign up here..
What: Freshwater surface water temperature, oxygen, and chlorophyll-a.
Where: 7 lakes and 27 river/stream NEON sites.
When: Daily forecasts for at least 30-days in the future. New forecast submissions, that use new data to update the forecast, are accepted daily. The only requirement is that submissions are predictions of the future at the time the forecast is submitted.
Today we will focus on lake sites only and will start with forecasting water temperature. For the challenge, you can chose to submit to either the lakes, rivers or streams or all three! You can also chose to submit any of the three focal variables (temperature, oxygen, and chlorophyll). Find more information about the aquatics challenge here.
For the Challange, forecasts must include quantified uncertainty. The file can represent uncertainty using an ensemble forecast (multiple realizations of future conditions) or a distribution forecast (with mean and standard deviation), specified in the family and parameter columns of the forecast file.
For an ensemble forecast, the family column uses the
word ensemble to designate that it is a ensemble forecast
and the parameter column is the ensemble member number (1, 2, 3 …). For
a distribution forecast, the family column uses the word
normal to designate a normal distribution and the parameter
column must have the words mu and sigma for each forecasted variable,
site_id, and datetime. For forecasts that don’t have a normal
distribution we recommend using the ensemble format and sampling from
your non-normal distribution to generate a set of ensemble members that
represents your distribution. I will go through examples of both
ensemble and normal forecasts as examples.
The full list of required columns and format can be found in the Challenge documentation.
We start forecasting by first looking at the historic data - called the ‘targets’. These data are available near real-time, with the latency of approximately 24-48 hrs. Here is how you read in the data from the targets file available from the EFI server.
#read in the targets data
targets <- read_csv('https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz')
Information on the NEON sites can be found in the
NEON_Field_Site_Metadata_20220412.csv file on GitHub. It
can be filtered to only include aquatic sites. This table has
information about the field sites, including location, ecoregion,
information about the watershed (e.g. elevation, mean annual
precipitation and temperature), and lake depth.
# read in the sites data
aquatic_sites <- read_csv("https://raw.githubusercontent.com/eco4cast/neon4cast-targets/main/NEON_Field_Site_Metadata_20220412.csv") |>
dplyr::filter(aquatics == 1)
Let’s take a look at the targets data!
## # A tibble: 11 × 4
## datetime site_id variable observation
## <date> <chr> <chr> <dbl>
## 1 2017-04-23 ARIK temperature 12.7
## 2 2017-04-24 ARIK chla NA
## 3 2017-04-24 ARIK oxygen 7.10
## 4 2017-04-24 ARIK temperature 14.8
## 5 2017-04-25 ARIK chla NA
## 6 2017-04-25 ARIK oxygen 6.58
## 7 2017-04-25 ARIK temperature 15.5
## 8 2017-04-26 ARIK chla NA
## 9 2017-04-26 ARIK oxygen 7.32
## 10 2017-04-26 ARIK temperature 13.0
## 11 2017-04-27 ARIK chla NA
The columns of the targets file show the time step (daily for
aquatics challenge), the 4 character site code (site_id),
the variable being measured, and the mean daily observation. To look at
only the lakes we can subset the targets and aquatic sites to those
which have the field_site_subtype of Lake.
lake_sites <- aquatic_sites %>%
filter(field_site_subtype == 'Lake')
targets <- targets %>%
filter(site_id %in% lake_sites$field_site_id)
Figure: Temperature targets data at aquatic sites provided by EFI for the NEON forecasting challgenge
Figure: Oxygen targets data at aquatic sites provided by EFI for the NEON forecasting challgenge
Figure: Chlorophyll targets data at aquatic sites provided by EFI for the NEON forecasting challgenge. Chlorophyll data is only available at lake and river sites
We can think about what type of models might be useful to predict these variables at these sites. Below are descriptions of three simple models which have been constructed to get you started forecasting:
To start, we will produce forecasts for just one of these target variables, surface water temperature.
targets <- targets %>%
filter(variable == 'temperature')
One important step to overcome when thinking about generating forecasts is to include co-variates in the model. A water temperature forecast, for example, may be benefit from information about past and future weather. The neon4cast challenge package includes functions for downloading past and future NOAA weather forecasts for all of the NEON sites. The 3 types of data are as follows:
This code create a connection to the dataset hosted on the eco4cast
server (neon4cast-drivers/noaa/gefs-v12) using
arrow functions. To download the data you have to tell the
function to collect() it. These data set can be subsetted
and filtered using dplyr functions prior to download to
limit the memory usage.
You can read more about the NOAA forecasts available for the NEON sites here:
We will generate a water temperature forecast using
air_temperature as a co-variate. Note: This code chunk can
take a few minutes to execute as it accesses the NOAA data.
# past stacked weather
noaa_past_s3 <- neon4cast::noaa_stage3()
variables <- c("air_temperature", "air_pressure")
#Other variable names can be found at https://projects.ecoforecast.org/neon4cast-docs/Shared-Forecast-Drivers.html#stage-3
noaa_past <- noaa_past_s3 |>
dplyr::filter(site_id %in% lake_sites$field_site_id,
datetime >= ymd('2017-01-01'),
variable %in% variables) |>
dplyr::collect()
noaa_past
<<<<<<< HEAD
## # A tibble: 12,905,858 × 7
## parameter datetime variable prediction family reference_datetime
## <dbl> <dttm> <chr> <dbl> <chr> <lgl>
## 1 0 2020-09-24 00:00:00 air_press… 95405. ensem… NA
## 2 1 2020-09-24 00:00:00 air_press… 95392. ensem… NA
## 3 2 2020-09-24 00:00:00 air_press… 95415. ensem… NA
## 4 3 2020-09-24 00:00:00 air_press… 95416. ensem… NA
## 5 4 2020-09-24 00:00:00 air_press… 95399. ensem… NA
## 6 5 2020-09-24 00:00:00 air_press… 95380. ensem… NA
## 7 6 2020-09-24 00:00:00 air_press… 95407. ensem… NA
## 8 7 2020-09-24 00:00:00 air_press… 95396. ensem… NA
## 9 8 2020-09-24 00:00:00 air_press… 95483. ensem… NA
## 10 9 2020-09-24 00:00:00 air_press… 95378. ensem… NA
## # ℹ 12,905,848 more rows
=======
## # A tibble: 13,051,682 × 7
## parameter datetime variable prediction family reference_datetime
## <dbl> <dttm> <chr> <dbl> <chr> <lgl>
## 1 1 2020-11-07 01:00:00 air_press… 95326. ensem… NA
## 2 2 2020-11-07 01:00:00 air_press… 95333. ensem… NA
## 3 3 2020-11-07 01:00:00 air_press… 95347. ensem… NA
## 4 4 2020-11-07 01:00:00 air_press… 95388. ensem… NA
## 5 5 2020-11-07 01:00:00 air_press… 95326. ensem… NA
## 6 6 2020-11-07 01:00:00 air_press… 95278. ensem… NA
## 7 7 2020-11-07 01:00:00 air_press… 95386. ensem… NA
## 8 8 2020-11-07 01:00:00 air_press… 95318. ensem… NA
## 9 9 2020-11-07 01:00:00 air_press… 95384. ensem… NA
## 10 10 2020-11-07 01:00:00 air_press… 95360. ensem… NA
## # ℹ 13,051,672 more rows
>>>>>>> a07efb2cd47cdcc995273873e0718466ba47033c
## # ℹ 1 more variable: site_id <chr>
This is a stacked ensemble forecast of the one day ahead forecasts.
To get an estimate of the historic conditions we can take a mean of
these ensembles. We will also need to convert the temperatures to
Celsius from Kelvin.
# aggregate the past to mean values
noaa_past_mean <- noaa_past |>
mutate(datetime = as_date(datetime)) |>
group_by(datetime, site_id, variable) |>
summarize(prediction = mean(prediction, na.rm = TRUE), .groups = "drop") |>
pivot_wider(names_from = variable, values_from = prediction) |>
# convert air temp to C
mutate(air_temperature = air_temperature - 273.15)
We can then look at the future weather forecasts in the same way but
using the noaa_stage2(). The forecast becomes available
from NOAA at 5am UTC the following day, so we take the air temperature
forecast from yesterday (noaa_date) to make the water
quality forecasts. Then we can use the ensembles to produce uncertainty
in the water temperature forecast by forecasting multiple (31) future
water temperatures.
# New forecast only available at 5am UTC the next day
forecast_date <- Sys.Date()
<<<<<<< HEAD
noaa_date <- forecast_date - days(1)
=======
noaa_date <- forecast_date - days(2)
>>>>>>> a07efb2cd47cdcc995273873e0718466ba47033c
noaa_future_s3 <- neon4cast::noaa_stage2(start_date = as.character(noaa_date))
variables <- c("air_temperature", "air_pressure")
noaa_future <- noaa_future_s3 |>
dplyr::filter(datetime >= forecast_date,
site_id %in% lake_sites$field_site_id,
variable %in% variables) |>
collect()
The forecasts are hourly and we are interested in using daily mean air temperature for water temperature forecast generation.
noaa_future_daily <- noaa_future |>
mutate(datetime = as_date(datetime)) |>
# mean daily forecasts at each site per ensemble
group_by(datetime, site_id, parameter, variable) |>
summarize(prediction = mean(prediction)) |>
pivot_wider(names_from = variable, values_from = prediction) |>
# convert to Celsius
mutate(air_temperature = air_temperature - 273.15) |>
select(datetime, site_id, air_temperature, air_pressure, parameter)
## `summarise()` has grouped output by 'datetime', 'site_id', 'parameter'. You can
## override using the `.groups` argument.
noaa_future_daily
<<<<<<< HEAD
## # A tibble: 7,595 × 5
## # Groups: datetime, site_id, parameter [7,595]
## datetime site_id air_temperature air_pressure parameter
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2024-02-15 BARC 15.1 101914. 0
## 2 2024-02-15 BARC 14.7 101850. 1
## 3 2024-02-15 BARC 15.2 101891. 2
## 4 2024-02-15 BARC 15.2 101849. 3
## 5 2024-02-15 BARC 15.2 101940. 4
## 6 2024-02-15 BARC 14.8 101963. 5
## 7 2024-02-15 BARC 15.2 101942. 6
## 8 2024-02-15 BARC 14.7 101836. 7
## 9 2024-02-15 BARC 15.4 101909. 8
## 10 2024-02-15 BARC 14.9 101899. 9
## # ℹ 7,585 more rows
Now we have a timeseries of historic data and a 30 member ensemble forecast of future air temperatures
## # A tibble: 7,378 × 5
## # Groups: datetime, site_id, parameter [7,378]
## datetime site_id air_temperature air_pressure parameter
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2024-02-29 BARC 17.0 102039. 0
## 2 2024-02-29 BARC 16.7 102049. 1
## 3 2024-02-29 BARC 16.8 101996. 2
## 4 2024-02-29 BARC 16.3 101998. 3
## 5 2024-02-29 BARC 16.8 102152. 4
## 6 2024-02-29 BARC 17.1 102049. 5
## 7 2024-02-29 BARC 15.9 102064. 6
## 8 2024-02-29 BARC 16.8 102026. 7
## 9 2024-02-29 BARC 16.6 102074. 8
## 10 2024-02-29 BARC 17.5 102075. 9
## # ℹ 7,368 more rows
Now we have a timeseries of historic data and a 30 member ensemble forecast of future air temperatures
Figure: historic and future NOAA air temeprature forecasts at lake sites
Figure: last two months of historic air temperature forecasts and 35 day ahead forecast
Figure: historic and future NOAA air temeprature forecasts at lake sites
We will fit a simple linear model between historic air temperature and the water temperature targets data. Using this model we can then use our future estimates of air temperature (all 30 ensembles) to estimate water temperature at each site. The ensemble weather forecast will therefore propagate uncertainty into the water temperature forecast and give an estimate of driving data uncertainty.
We will start by joining the historic weather data with the targets to aid in fitting the linear model.
targets_lm <- targets |>
filter(variable == 'temperature') |>
pivot_wider(names_from = 'variable', values_from = 'observation') |>
left_join(noaa_past_mean,
by = c("datetime","site_id"))
targets_lm[1000:1010,]
## # A tibble: 11 × 5
## datetime site_id temperature air_pressure air_temperature
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2020-10-16 BARC NA 101197. 25.5
## 2 2020-10-17 BARC NA 101555. 22.6
## 3 2020-10-18 BARC NA 101603. 23.6
## 4 2020-10-19 BARC NA 101471. 24.1
## 5 2020-10-20 BARC NA 101315. 24.7
## 6 2020-10-21 BARC NA 101294. 24.4
## 7 2020-10-22 BARC NA 101313. 23.8
## 8 2020-10-23 BARC NA 101265. 23.7
## 9 2020-10-24 BARC NA 101094. 24.0
## 10 2020-10-25 BARC NA 100846. 24.9
## 11 2020-10-26 BARC NA 101128. 24.9
#Deterministic Forecast
Now we will generate a deterministic forecast with our model. We will use one ensemble member from the NOAA GEFS air temperature forecast ensemble as input to our multiple linear regression model, thus producing a water temperature prediction for 1 to 7 days into the future with no uncertainty.
Set the number of ensemble members; this is set to 1 because we are making a deterministic forecast.
n_members <- 1
Setting up an empty dataframe that we will fill with our water
temperature predictions. Here, the mutate() function is
used to insert the current observed water temperature as the initial
condition and set all future values of water temperature to NA, which
will subsequently be replaced with forecasted values by our model.
Run forecast. Here, we loop through days into the future and generate
predictions with our multiple regression model using yesterday’s water
temperature and today’s air temperature. Note we use the
rows_update() function to replace NAs with forecasted water
temperature values each day.
temp_deterministic_forecast <- NULL
#Set forecast horizon
forecast_horizon <- 30
forecasted_dates <- seq(from = ymd(forecast_date), to = ymd(forecast_date) + forecast_horizon, by = "day")
#Looping through sites
for(n in 1:length(lake_sites$field_site_id)) {
example_site <- lake_sites$field_site_id[n]
#Set up an empty dataframe to fill up the forecast data later
forecast_deterministic <- tibble(forecast_date = rep(forecasted_dates, times = n_members),
ensemble_member = rep(1:n_members, each = length(forecasted_dates)),
forecast_variable = "water_temperature",
site_id = example_site,
value = as.double(NA),
uc_type = "deterministic")
#Start off by fitting a linear model
linear_fit2 <- lm(targets_lm$temperature ~ targets_lm$air_temperature + targets_lm$air_pressure)
#Looping through forecast horizon
for(o in 1:length(forecasted_dates)) {
#pull prediction dataframe for the relevant date
temp_pred <- forecast_deterministic %>%
filter(forecast_date == forecasted_dates[o])
#pull driver data for the relevant date; here we select only 1 ensemble member from the NOAA air temperature forecast
temp_driv <- noaa_future_daily %>%
filter(datetime == forecasted_dates[o] & parameter == 1 & site_id == example_site)
#pull lagged water temp values
#temp_lag <- forecast_deterministic %>%
#filter(forecast_date == forecasted_dates[o-1])
#run model
temp_pred$value <- linear_fit2$coefficients[1] + temp_driv$air_temperature * linear_fit2$coefficients[2] + temp_driv$air_pressure * linear_fit2$coefficients[3]
#insert values back into the forecast dataframe
forecast_deterministic <- forecast_deterministic %>%
rows_update(temp_pred, by = c("forecast_date","ensemble_member" ,"forecast_variable", "site_id", "uc_type"))
}
temp_deterministic_forecast <- dplyr::bind_rows(temp_deterministic_forecast, forecast_deterministic)
}
Build plot. This should resemble the plot in the R Shiny app Activity B Overview, labeled “Water Temperature Forecast”; here, we are plotting the “Both” model, which uses both yesterday’s water temperature and today’s forecasted air temperature to forecast water temperature.
ggplot()+
geom_line(data = temp_deterministic_forecast, aes(x=forecast_date, y=value, group = ensemble_member, alpha = 0.4)) +
geom_vline(xintercept = as_date(forecast_date), linetype = "dashed") +
labs(title = "Deterministic forecast of water temperature for different NEON sites")+
ylab("water temperature (degC)")+
scale_x_date(date_breaks = "1 week", date_labels = "%b %d")+
facet_wrap(~site_id, scales = 'free')
<<<<<<< HEAD
#Probabilistic Forecast
To fit the linear model we use the base R lm() but there
are also methods to fit linear (and non-linear) models in the
fable:: package. You can explore the documentation for
more information on the fable::TSLM() function. We can fit
a separate linear model for each site. For example, at Lake Suggs, this
would look like:
#Set ensemble members constant for all the sites
n_members <- 31
temp_driver_forecast <- NULL
temp_parameter_forecast <- NULL
temp_process_forecast <- NULL
temp_total_forecast <- NULL
#Looping through sites
for(i in 1:length(lake_sites$field_site_id)) {
example_site <- lake_sites$field_site_id[i]
#Create empty dataframes for driver, parameter and process uncertainty to fill up later
forecast_driver_unc <- tibble(forecast_date = rep(forecasted_dates, times = n_members),
ensemble_member = rep(1:n_members, each = length(forecasted_dates)),
forecast_variable = "water_temperature",
site_id = example_site,
value = as.double(NA),
uc_type = "driver")
forecast_parameter_unc <- tibble(forecast_date = rep(forecasted_dates, times = n_members),
ensemble_member = rep(1:n_members, each = length(forecasted_dates)),
forecast_variable = "water_temperature",
site_id = example_site,
value = as.double(NA),
uc_type = "parameter")
forecast_process_unc <- tibble(forecast_date = rep(forecasted_dates, times = n_members),
ensemble_member = rep(1:n_members, each = length(forecasted_dates)),
forecast_variable = "water_temperature",
site_id = example_site,
value = as.double(NA),
uc_type = "process")
#Create an empty dataframe to assemble total forecast uncertainties
forecast_total_unc <- tibble(forecast_date = rep(forecasted_dates, times = n_members),
ensemble_member = rep(1:n_members, each = length(forecasted_dates)),
forecast_variable = "water_temperature",
site_id = example_site,
value = as.double(NA),
uc_type = "total")
#Create everything needed for chosen site to use inside date loop
#Modeling section
#Start off by fitting a linear model
linear_fit <- lm(targets_lm$temperature ~ targets_lm$air_temperature + targets_lm$air_pressure)
fit_summary <- summary(linear_fit)
#Store coefficients and parameters
coeffs <- round(linear_fit$coefficients, 2)
params_se <- fit_summary$coefficients[,2]
#Calculate model predictions using above model
mod <- predict(linear_fit, data = targets_lm)
#Create distribution table to use for parameter uncertainty later
param_df <- data.frame(beta1 = rnorm(n_members, coeffs[1], params_se[1]),
beta2 = rnorm(n_members, coeffs[2], params_se[2]),
beta3 = rnorm(n_members, coeffs[3], params_se[3]))
#Calculate model assessment matrices
r2 <- round(fit_summary$r.squared, 2)
residuals <- mod - targets_lm$temperature
err <- mean(residuals, na.rm = TRUE)
rmse <- round(sqrt(mean((mod - targets_lm$temperature)^2, na.rm = TRUE)), 2)
#Create sigma distribution to use for process uncertainty later
sigma <- sd(residuals, na.rm = TRUE)
#Forecasting section for driver uncertainty
for(j in 1:length(forecasted_dates)) {
#pull prediction dataframe for the relevant date
temp_pred <- forecast_driver_unc %>%
filter(forecast_date == forecasted_dates[j])
#pull driver ensemble for the relevant date; here we are using all 30 NOAA ensemble members
temp_driv <- noaa_future_daily %>%
filter(datetime == forecasted_dates[j] & site_id == example_site)
na.omit(temp_driv)
#pull lagged water temp values
#temp_lag <- forecast_driver_unc %>%
#filter(forecast_date == forecasted_dates[j-1])
#run model
temp_pred$value <- coeffs[1] + temp_driv$air_temperature * coeffs[2] + temp_driv$air_pressure * coeffs[3]
#insert values back into the forecast dataframe
forecast_driver_unc <- forecast_driver_unc %>%
rows_update(temp_pred, by = c("forecast_date","ensemble_member","forecast_variable","site_id","uc_type"))
}
#Forecasting section for parameter uncertainty
for(k in 1:length(forecasted_dates)) {
#pull prediction dataframe for the relevant date
temp_pred <- forecast_parameter_unc %>%
filter(forecast_date == forecasted_dates[k])
#pull driver data for the relevant date; here we select only 1 ensemble member from the NOAA air temperature forecast
temp_driv <- noaa_future_daily %>%
filter(datetime == forecasted_dates[k] & site_id == example_site & parameter == 1)
na.omit(temp_driv)
#pull lagged water temp values
#temp_lag <- forecast_parameter_unc %>%
# filter(forecast_date == forecasted_dates[k-1])
#run model using parameter distributions instead of fixed values
temp_pred$value <- param_df$beta1 + temp_driv$air_temperature * param_df$beta2 + temp_driv$air_pressure * param_df$beta3
#insert values back into the forecast dataframe
forecast_parameter_unc <- forecast_parameter_unc %>%
rows_update(temp_pred, by = c("forecast_date","ensemble_member","forecast_variable","site_id","uc_type"))
}
##Forecasting section for process uncertainty
for(l in 1:length(forecasted_dates)) {
#pull prediction dataframe for the relevant date
temp_pred <- forecast_process_unc %>%
filter(forecast_date == forecasted_dates[l])
#pull driver data for the relevant date; here we select only 1 ensemble member from the NOAA air temperature forecast
temp_driv <- noaa_future_daily %>%
filter(datetime == forecasted_dates[l] & parameter == 1 & site_id == example_site)
na.omit(temp_driv)
#pull lagged water temp values
#temp_lag <- forecast_process_unc %>%
#filter(forecast_date == forecasted_dates[l-1])
#run model with process uncertainty added
temp_pred$value <- coeffs[1] + temp_driv$air_temperature * coeffs[2] + temp_driv$air_pressure * coeffs[3] + rnorm(n = n_members, mean = 0, sd = sigma)
#insert values back into the forecast dataframe
forecast_process_unc <- forecast_process_unc %>%
rows_update(temp_pred, by = c("forecast_date","ensemble_member","forecast_variable","site_id","uc_type"))
}
#Forecasting section for total forecast uncertainty
for(m in 1:length(forecasted_dates)) {
#pull prediction dataframe for relevant date
temp_pred <- forecast_total_unc %>%
filter(forecast_date == forecasted_dates[m])
#pull driver ensemble for the relevant date; here we are using all 30 NOAA ensemble members
temp_driv <- noaa_future_daily %>%
filter(datetime == forecasted_dates[m] & site_id == example_site)
#pull lagged water temp values
#temp_lag <- forecast_total_unc %>%
#filter(forecast_date == forecasted_dates[m-1])
#run model using initial conditions and parameter distributions instead of fixed values, and adding process uncertainty
temp_pred$value <- param_df$beta1 + temp_driv$air_temperature * param_df$beta2 + temp_driv$air_pressure * param_df$beta3 + rnorm(n = n_members, mean = 0, sd = sigma)
#insert values back into the forecast dataframe
forecast_total_unc <- forecast_total_unc %>%
rows_update(temp_pred, by = c("forecast_date","ensemble_member","forecast_variable", "site_id", "uc_type"))
}
temp_driver_forecast <- dplyr::bind_rows(temp_driver_forecast, forecast_driver_unc)
temp_parameter_forecast <- dplyr::bind_rows(temp_parameter_forecast, forecast_parameter_unc)
temp_process_forecast <- dplyr::bind_rows(temp_process_forecast, forecast_process_unc)
temp_total_forecast <- dplyr::bind_rows(temp_total_forecast, forecast_total_unc)
}
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
## Warning in mod - targets_lm$temperature: longer object length is not a multiple
## of shorter object length
Let’s visualize the time series of all different uncertainties for all the sites.
#Plot the forecasted water temperature for all different uncertainties
ggplot()+
geom_line(data = temp_driver_forecast, aes(x=forecast_date, y=value, group = ensemble_member, alpha = 0.4)) +
geom_vline(xintercept = as_date(forecast_date), linetype = "dashed") +
labs(title = "Driver uncertainty in water temperature forecast for different NEON sites")+
ylab("water temperature")+
facet_wrap(~site_id, scales = 'free')+
scale_x_date(date_breaks = "1 week", date_labels = "%b %d")
<<<<<<< HEAD
ggplot()+
geom_line(data = temp_parameter_forecast, aes(x=forecast_date, y=value, group = ensemble_member, alpha = 0.4)) +
geom_vline(xintercept = as_date(forecast_date), linetype = "dashed") +
labs(title = "Parameter uncertainty in water temperature forecast for different NEON sites")+
ylab("water temperature")+
facet_wrap(~site_id, scales = 'free')+ scale_x_date(date_breaks = "1 week",date_labels = "%b %d")
<<<<<<< HEAD
ggplot()+
geom_line(data = temp_process_forecast, aes(x=forecast_date, y=value, group = ensemble_member, alpha = 0.4)) +
geom_vline(xintercept = as_date(forecast_date), linetype = "dashed") +
labs(title = "Process uncertainty in water temperature forecast for different NEON sites")+
ylab("water temperature")+
facet_wrap(~site_id, scales = 'free')+scale_x_date(date_breaks = "1 week", date_labels = "%b %d")
<<<<<<< HEAD
ggplot()+
geom_line(data = temp_total_forecast, aes(x=forecast_date, y=value, group = ensemble_member, alpha = 0.4)) +
geom_vline(xintercept = as_date(forecast_date), linetype = "dashed") +
labs(title = "Total uncertainty in water temperature forecast for different NEON sites")+
ylab("water temperature")+
facet_wrap(~site_id, scales = 'free')+scale_x_date(date_breaks = "1 week", date_labels = "%b %d")
<<<<<<< HEAD
#Partition Total Forecast Uncertainty
#Combine all uncertainties into a single dataframe
all_forecast_df <- bind_rows(temp_driver_forecast, temp_parameter_forecast) %>%
bind_rows(., temp_process_forecast)
#Group the dataframe by date and the type of uncertainty included in the forecast, then calculate the standard deviation across all 30 ensemble members for each date and uncertainty type.
sd_df <- all_forecast_df %>%
group_by(forecast_date, uc_type) %>%
summarize(sd = sd(value, na.rm = TRUE), .groups = "drop")
Plot the contribution of each type of uncertainties to total forecast uncertainty.
ggplot() +
geom_bar(data = sd_df, aes(x = forecast_date, y = sd, fill = uc_type), stat = "identity", position = "stack") +
ylab("Standard Deviation (\u00B0C)") +
xlab("Forecasted date") +
scale_fill_manual(values = c("process" = "orange", "parameter" = "black", "initial_conditions" = "blue","driver" = "purple")) +
scale_x_date(date_breaks = "1 day", date_labels = "%b %d") +
labs(fill = "Uncertainty") +
theme_bw(base_size = 12)+
scale_x_date(date_breaks = "1 week")
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
<<<<<<< HEAD